# Import important libraries needed for functions
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(rmarkdown)
library(summarytools)
##
## Attaching package: 'summarytools'
##
## The following object is masked from 'package:tibble':
##
## view
library(dplyr)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
The dataset is San Francisco dataset, from 2018 to present. The description of each column is -
Incident Datetime - The date and time when the incident occurred
Incident Date - The date the incident occurred
Incident Time- The time the incident occurred
Incident Year - The year the incident occurred, provided as a convenience for filtering
Incident Day of Week - The day of week the incident occurred
Report Datetime - Distinct from Incident Datetime, Report Datetime is when the report was filed.
Row ID - A unique identifier for each row of data in the dataset
Incident ID - This is the system generated identifier for incident reports. Incident IDs and Incident Numbers both uniquely identify reports, but Incident Numbers are used when referencing cases and report documents.
Incident Number - The number issued on the report, sometimes interchangeably referred to as the Case Number. This number is used to reference cases and report documents.
CAD Number - The Computer Aided Dispatch (CAD) is the system used by the Department of Emergency Management (DEM) to dispatch officers and other public safety personnel. CAD Numbers are assigned by the DEM system and linked to relevant incident reports (Incident Number).
Report Type Code - A system code for report types, these have corresponding descriptions within the dataset.
Report Type Description - The description of the report type, can be one of: Initial; Initial Supplement; Vehicle Initial; Vehicle Supplement; Coplogic Initial; Coplogic Supplement
Filed Online - Non- emergency police reports can be filed online by members of the public using SFPD’s self-service reporting system called Coplogic Values in this field will be “TRUE” if Coplogic was used to file the report.
Incident Code - Incident Codes are the system codes to describe a type of incident. A single incident report can have one or more incident types associated. In those cases you will see multiple rows representing a unique combination of the Incident ID and Incident Code.
Incident Category - A category mapped on to the Incident Code used in statistics and reporting. Mappings provided by the Crime Analysis Unit of the Police Department.
Incident Subcategory - A subcategory mapped to the Incident Code that is used for statistics and reporting. Mappings are provided by the Crime Analysis Unit of the Police Department.
Incident Description - The description of the incident that corresponds with the Incident Code. These are generally self-explanatory.
Resolution - The resolution of the incident at the time of the report. Can be one of: • Cite or Arrest Adult • Cite or Arrest Juvenile* • Exceptional Adult • Exceptional Juvenile* • Open or Active • Unfounded Juvenile information not maintained in the dataset.
Intersection - The 2 or more street names that intersect closest to the original incident separated by a backward slash.
CNN - The unique identifier of the intersection for reference back to other related basemap datasets.
Police District - The Police District where the incident occurred. District boundaries can be reviewed in the link below.
Analysis Neighborhood - This field is used to identify the neighborhood where each incident occurs. Neighborhoods and boundaries are defined by the Department of Public Health and the Mayor’s Office of Housing and Community Development.
Supervisor District - There are 11 members elected to the Board of Supervisors in San Francisco, each representing a geographic district. The Board of Supervisors is the legislative body for San Francisco. The districts are numbered 1 through 11.
Latitude - The latitude coordinate in WGS84, spatial reference is EPSG:4326
Longitude - The longitude coordinate in WGS84, spatial reference is EPSG:4326
Point - Geolocation in OGC WKT format (e.g, POINT(37.4,-122.3)
As we can see that this dataset is large, with many fields.To perform some infer analysis and infer from the data, we will need to perform some cleaning, tidying and identify which of these fields can be beneficial to us.
As we can see that this dataset is large, with many fields.To perform some infer analysis and infer from the data, we will need to perform some cleaning, tidying and identify which of these fields can be beneficial to us.
###Dataset
Reading in the data
# Read data from the csv file
data <- read.csv("_data/SF_Incident_Reports__2018_to_Present.csv")
The dimensions of the data
dim(data)
## [1] 669605 34
The datatype of each column
str(data)
## 'data.frame': 669605 obs. of 34 variables:
## $ Incident.Datetime : chr "2021/07/25 12:00:00 AM" "2022/06/28 11:58:00 PM" "2022/03/11 10:30:00 AM" "2021/05/15 05:47:00 PM" ...
## $ Incident.Date : chr "2021/07/25" "2022/06/28" "2022/03/11" "2021/05/15" ...
## $ Incident.Time : chr "00:00" "23:58" "10:30" "17:47" ...
## $ Incident.Year : int 2021 2022 2022 2021 2022 2021 2022 2022 2022 2018 ...
## $ Incident.Day.of.Week : chr "Sunday" "Tuesday" "Friday" "Saturday" ...
## $ Report.Datetime : chr "2021/07/25 01:41:00 PM" "2022/06/28 11:58:00 PM" "2022/03/11 08:03:00 PM" "2021/05/15 05:47:00 PM" ...
## $ Row.ID : num 1.06e+11 1.17e+11 1.13e+11 1.03e+11 1.17e+11 ...
## $ Incident.ID : int 1057189 1165543 1130480 1030518 1165351 1094352 1165462 1182772 1147129 669184 ...
## $ Incident.Number : int 216105573 220264913 226040232 210183345 220361741 216178097 226109026 226147327 210618041 186110767 ...
## $ CAD.Number : int NA NA NA NA NA NA NA NA NA NA ...
## $ Report.Type.Code : chr "II" "VS" "II" "VS" ...
## $ Report.Type.Description : chr "Coplogic Initial" "Vehicle Supplement" "Coplogic Initial" "Vehicle Supplement" ...
## $ Filed.Online : chr "true" "" "true" "" ...
## $ Incident.Code : int 6372 71012 71000 7043 7041 28150 6244 6374 7041 71000 ...
## $ Incident.Category : chr "Larceny Theft" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
## $ Incident.Subcategory : chr "Larceny Theft - Other" "Other Offenses" "Lost Property" "Recovered Vehicle" ...
## $ Incident.Description : chr "Theft, Other Property, $50-$200" "License Plate, Recovered" "Lost Property" "Vehicle, Recovered, Motorcycle" ...
## $ Resolution : chr "Open or Active" "Open or Active" "Open or Active" "Open or Active" ...
## $ Intersection : chr "" "" "" "" ...
## $ CNN : num NA NA NA NA NA NA NA NA NA NA ...
## $ Police.District : chr "Southern" "Out of SF" "Central" "Out of SF" ...
## $ Analysis.Neighborhood : chr "" "" "" "" ...
## $ Supervisor.District : int NA NA NA NA NA NA NA NA NA NA ...
## $ Latitude : num NA NA NA NA NA NA NA NA NA NA ...
## $ Longitude : num NA NA NA NA NA NA NA NA NA NA ...
## $ Point : chr "" "" "" "" ...
## $ Neighborhoods : int NA NA NA NA NA NA NA NA NA NA ...
## $ ESNCAG...Boundary.File : int NA NA NA NA NA NA NA NA NA NA ...
## $ Central.Market.Tenderloin.Boundary.Polygon...Updated: int NA NA NA NA NA NA NA NA NA NA ...
## $ Civic.Center.Harm.Reduction.Project.Boundary : int NA NA NA NA NA NA NA NA NA NA ...
## $ HSOC.Zones.as.of.2018.06.05 : int NA NA NA NA NA NA NA NA NA NA ...
## $ Invest.In.Neighborhoods..IIN..Areas : logi NA NA NA NA NA NA ...
## $ Current.Supervisor.Districts : int NA NA NA NA NA NA NA NA NA NA ...
## $ Current.Police.Districts : int NA NA NA NA NA NA NA NA NA NA ...
Now let us check how many null values are present in the dataset.
sum(is.na(data))
## [1] 3441408
Let us check which columns have how many missing values.
missing_vals <- colSums(is.na(data))
missing_vals[sapply(missing_vals, function(x) any(x > 0))]
## CAD.Number
## 149698
## CNN
## 35458
## Supervisor.District
## 35458
## Latitude
## 35458
## Longitude
## 35458
## Neighborhoods
## 48996
## ESNCAG...Boundary.File
## 662313
## Central.Market.Tenderloin.Boundary.Polygon...Updated
## 583849
## Civic.Center.Harm.Reduction.Project.Boundary
## 584097
## HSOC.Zones.as.of.2018.06.05
## 529243
## Invest.In.Neighborhoods..IIN..Areas
## 669605
## Current.Supervisor.Districts
## 35571
## Current.Police.Districts
## 36204
As we can see, there are some columns with more than 5 lakh nulll values. We can handle that when we are cleaning the data
Dataframe summary
print(summarytools::dfSummary(data,
varnumbers = FALSE,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.70,
valid.col = FALSE),
method = 'render',
table.classes = 'table-condensed')
| Variable | Stats / Values | Freqs (% of Valid) | Graph | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Incident.Datetime [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Date [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Time [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Year [integer] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Day.of.Week [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Report.Datetime [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Row.ID [numeric] |
|
669605 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.ID [integer] |
|
558805 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Number [integer] |
|
482981 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CAD.Number [integer] |
|
382364 distinct values | 149698 (22.4%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Report.Type.Code [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Report.Type.Description [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Filed.Online [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Code [integer] |
|
836 distinct values | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Category [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Subcategory [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Incident.Description [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Resolution [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Intersection [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| CNN [numeric] |
|
6466 distinct values | 35458 (5.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Police.District [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Analysis.Neighborhood [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Supervisor.District [integer] |
|
11 distinct values | 35458 (5.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Latitude [numeric] |
|
6466 distinct values | 35458 (5.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Longitude [numeric] |
|
6466 distinct values | 35458 (5.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Point [character] |
|
|
0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Neighborhoods [integer] |
|
116 distinct values | 48996 (7.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ESNCAG...Boundary.File [integer] | 1 distinct value |
|
662313 (98.9%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Central.Market.Tenderloin.Boundary.Polygon...Updated [integer] | 1 distinct value |
|
583849 (87.2%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Civic.Center.Harm.Reduction.Project.Boundary [integer] | 1 distinct value |
|
584097 (87.2%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HSOC.Zones.as.of.2018.06.05 [integer] |
|
|
529243 (79.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Invest.In.Neighborhoods..IIN..Areas [logical] |
|
669605 (100.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Current.Supervisor.Districts [integer] |
|
11 distinct values | 35571 (5.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Current.Police.Districts [integer] |
|
|
36204 (5.4%) |
Generated by summarytools 1.0.1 (R version 4.2.2)
2022-12-22
head(data)
## Incident.Datetime Incident.Date Incident.Time Incident.Year
## 1 2021/07/25 12:00:00 AM 2021/07/25 00:00 2021
## 2 2022/06/28 11:58:00 PM 2022/06/28 23:58 2022
## 3 2022/03/11 10:30:00 AM 2022/03/11 10:30 2022
## 4 2021/05/15 05:47:00 PM 2021/05/15 17:47 2021
## 5 2022/06/28 05:22:00 PM 2022/06/28 17:22 2022
## 6 2021/11/18 01:30:00 PM 2021/11/18 13:30 2021
## Incident.Day.of.Week Report.Datetime Row.ID Incident.ID
## 1 Sunday 2021/07/25 01:41:00 PM 105718906372 1057189
## 2 Tuesday 2022/06/28 11:58:00 PM 116554371012 1165543
## 3 Friday 2022/03/11 08:03:00 PM 113048071000 1130480
## 4 Saturday 2021/05/15 05:47:00 PM 103051807043 1030518
## 5 Tuesday 2022/06/28 05:22:00 PM 116535107041 1165351
## 6 Thursday 2021/11/18 04:24:00 PM 109435228150 1094352
## Incident.Number CAD.Number Report.Type.Code Report.Type.Description
## 1 216105573 NA II Coplogic Initial
## 2 220264913 NA VS Vehicle Supplement
## 3 226040232 NA II Coplogic Initial
## 4 210183345 NA VS Vehicle Supplement
## 5 220361741 NA VS Vehicle Supplement
## 6 216178097 NA II Coplogic Initial
## Filed.Online Incident.Code Incident.Category Incident.Subcategory
## 1 true 6372 Larceny Theft Larceny Theft - Other
## 2 71012 Other Offenses Other Offenses
## 3 true 71000 Lost Property Lost Property
## 4 7043 Recovered Vehicle Recovered Vehicle
## 5 7041 Recovered Vehicle Recovered Vehicle
## 6 true 28150 Malicious Mischief Vandalism
## Incident.Description Resolution Intersection CNN
## 1 Theft, Other Property, $50-$200 Open or Active NA
## 2 License Plate, Recovered Open or Active NA
## 3 Lost Property Open or Active NA
## 4 Vehicle, Recovered, Motorcycle Open or Active NA
## 5 Vehicle, Recovered, Auto Open or Active NA
## 6 Malicious Mischief, Vandalism to Property Open or Active NA
## Police.District Analysis.Neighborhood Supervisor.District Latitude Longitude
## 1 Southern NA NA NA
## 2 Out of SF NA NA NA
## 3 Central NA NA NA
## 4 Out of SF NA NA NA
## 5 Out of SF NA NA NA
## 6 Mission NA NA NA
## Point Neighborhoods ESNCAG...Boundary.File
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Central.Market.Tenderloin.Boundary.Polygon...Updated
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## Civic.Center.Harm.Reduction.Project.Boundary HSOC.Zones.as.of.2018.06.05
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Invest.In.Neighborhoods..IIN..Areas Current.Supervisor.Districts
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## Current.Police.Districts
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
The columns below have maximum of null values and wont be much benefical to us for analysis. Hence we can drop them. Civic.Center.Harm.Reduction.Project.Boundary HSOC.Zones.as.of.2018.06.05 Invest.In.Neighborhoods..IIN..Areas ESNCAG…Boundary.File Central.Market.Tenderloin.Boundary.Polygon…Updated Civic.Center.Harm.Reduction.Project.Boundary
cleaned_data <- select(data,-28,-29,-30,-31,-32,-33,-34)
head(cleaned_data)
## Incident.Datetime Incident.Date Incident.Time Incident.Year
## 1 2021/07/25 12:00:00 AM 2021/07/25 00:00 2021
## 2 2022/06/28 11:58:00 PM 2022/06/28 23:58 2022
## 3 2022/03/11 10:30:00 AM 2022/03/11 10:30 2022
## 4 2021/05/15 05:47:00 PM 2021/05/15 17:47 2021
## 5 2022/06/28 05:22:00 PM 2022/06/28 17:22 2022
## 6 2021/11/18 01:30:00 PM 2021/11/18 13:30 2021
## Incident.Day.of.Week Report.Datetime Row.ID Incident.ID
## 1 Sunday 2021/07/25 01:41:00 PM 105718906372 1057189
## 2 Tuesday 2022/06/28 11:58:00 PM 116554371012 1165543
## 3 Friday 2022/03/11 08:03:00 PM 113048071000 1130480
## 4 Saturday 2021/05/15 05:47:00 PM 103051807043 1030518
## 5 Tuesday 2022/06/28 05:22:00 PM 116535107041 1165351
## 6 Thursday 2021/11/18 04:24:00 PM 109435228150 1094352
## Incident.Number CAD.Number Report.Type.Code Report.Type.Description
## 1 216105573 NA II Coplogic Initial
## 2 220264913 NA VS Vehicle Supplement
## 3 226040232 NA II Coplogic Initial
## 4 210183345 NA VS Vehicle Supplement
## 5 220361741 NA VS Vehicle Supplement
## 6 216178097 NA II Coplogic Initial
## Filed.Online Incident.Code Incident.Category Incident.Subcategory
## 1 true 6372 Larceny Theft Larceny Theft - Other
## 2 71012 Other Offenses Other Offenses
## 3 true 71000 Lost Property Lost Property
## 4 7043 Recovered Vehicle Recovered Vehicle
## 5 7041 Recovered Vehicle Recovered Vehicle
## 6 true 28150 Malicious Mischief Vandalism
## Incident.Description Resolution Intersection CNN
## 1 Theft, Other Property, $50-$200 Open or Active NA
## 2 License Plate, Recovered Open or Active NA
## 3 Lost Property Open or Active NA
## 4 Vehicle, Recovered, Motorcycle Open or Active NA
## 5 Vehicle, Recovered, Auto Open or Active NA
## 6 Malicious Mischief, Vandalism to Property Open or Active NA
## Police.District Analysis.Neighborhood Supervisor.District Latitude Longitude
## 1 Southern NA NA NA
## 2 Out of SF NA NA NA
## 3 Central NA NA NA
## 4 Out of SF NA NA NA
## 5 Out of SF NA NA NA
## 6 Mission NA NA NA
## Point Neighborhoods
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
###Mutate data
We can split the column Incident Time column into Minute and hour, so we can manipulate the hour column in further analysis.
mutated_data <- cleaned_data %>%
separate('Incident.Time',c('Incident.Hour','Incident.Minute'),sep = ":")
mutated_data <- mutated_data %>%
mutate(Incident.Month = month(Incident.Datetime, label = T))
crime <- mutated_data
###QUESTIONS:
After observing the data, certain questions pop up in my mind.We can make observations about crime has increased over the years, how COVID has affected the crime rate, the district wise distribution of crime, what is the most frequent resolution in different districts,what are the top crimes and what time does it occur,what are the crime hotspots.
I will now go over these questions and try to plot interesting graphs so we can get some inference from it.
##Crime over the years:
#a)Distribution of crime over the years from 2018 - Present.
We can group by the Incident Year and plot the graph.
crime_per_year <- crime %>%
group_by(Incident.Year) %>%
summarise(total = n())
colnames(crime_per_year) <- c("Year","Total")
library(ggplot2)
theme_set(theme_classic())
# Plot
g <- ggplot(crime_per_year, aes(Year, Total))
g + geom_bar(stat="identity", width = 0.5, fill="blue") +
labs(title="Crime in San Francisco",
subtitle="2018-2022",
caption=" Total crime over the years") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
From this graph we can observe that the crime has reduced during COVID years of 2020-2021, which can be an indicator that either the crime had reduced due to lockdown conditions or the reported crime had reduced during those years. We can also observe the crime decreasing slightly from 2018 to Present day.
#b) Distribution of crime per category
We can create a dataframe, by grouping by on the basis of Incident Category and also calculate the percentage.
crime_category <- sort(table(crime$Incident.Category),decreasing = TRUE)
crime_category <- data.frame(crime_category[crime_category> 10000])
colnames(crime_category) <- c("Category", "Frequency")
crime_category$Percentage <- crime_category$Frequency / sum(crime_category$Frequency)
crime_category
## Category Frequency Percentage
## 1 Larceny Theft 204121 0.33714656
## 2 Other Miscellaneous 47065 0.07773724
## 3 Malicious Mischief 45368 0.07493430
## 4 Assault 40680 0.06719114
## 5 Non-Criminal 40069 0.06618195
## 6 Burglary 37685 0.06224430
## 7 Motor Vehicle Theft 33794 0.05581753
## 8 Recovered Vehicle 25807 0.04262541
## 9 Fraud 21054 0.03477488
## 10 Lost Property 19705 0.03254674
## 11 Warrant 19623 0.03241130
## 12 Drug Offense 16105 0.02660062
## 13 Robbery 15000 0.02477549
## 14 Missing Person 14516 0.02397607
## 15 Suspicious Occ 13788 0.02277363
## 16 Disorderly Conduct 11057 0.01826284
library(ggplot2)
library(ggrepel)
bp<-ggplot(crime_category, aes(x=Category, y=Frequency, fill=Category)) + geom_bar(stat="identity") +
theme(axis.text.x=element_blank()) + geom_text_repel(data=crime_category, aes(label=Category))
bp
From the plot we can observe that count of Larceny Theft, is much much greater than the other categories.Disorderly conduct is the least frequent category.
#c) Time Series graph for Daily Crimes from 2018-2022
We can mutate the date and group by date, to get the number of crimes each day.
library(dplyr)
df_crime_daily <- crime %>%
mutate(Date = as.Date(Incident.Date, "%Y/%m/%d")) %>%
group_by(Date) %>%
summarize(count = n()) %>%
arrange(Date)
head(df_crime_daily)
## # A tibble: 6 × 2
## Date count
## <date> <int>
## 1 2018-01-01 545
## 2 2018-01-02 393
## 3 2018-01-03 429
## 4 2018-01-04 468
## 5 2018-01-05 447
## 6 2018-01-06 451
library(ggplot2)
library(scales)
plot <- ggplot(df_crime_daily, aes(x = Date, y = count)) +
geom_line(color = "#F2CA27", size = 0.1) +
geom_smooth(color = "#1A1A1A") +
# fte_theme() +
scale_x_date(breaks = date_breaks("1 year"), labels = date_format("%Y")) +
labs(x = "Date of Crime", y = "Number of Crimes", title = "Daily Crimes in San Francisco from 2018 – 2022")
plot
We can again oberve the dip during the COVID lockdown period, and in general observe around 300-500 crimes per day. With unual low as well as high spikes.
##Crime in District -wise
#a) District wise Crime Distribution
We can create a dataframe by grouping by the Police district.
crime_per_district <- crime %>%
group_by(Police.District) %>%
summarise(n = n())
colnames(crime_per_district) <- c("Police.District", "Total")
head(crime_per_district )
## # A tibble: 6 × 2
## Police.District Total
## <chr> <int>
## 1 Bayview 59192
## 2 Central 99986
## 3 Ingleside 50612
## 4 Mission 84307
## 5 Northern 91046
## 6 Out of SF 19675
g <- ggplot(crime_per_district, aes(Police.District, Total))
g + geom_bar(stat="identity", width = 0.5, fill="pink") +
labs(title="District-wise crime in San Francisco",
subtitle="2018-2022",
caption=" Total crime in each district") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
From the graph, we can see Central having maximum number of crime and Out of SF is the least.Park can be said to be the safest district as it has the least amount of crime.
#b) Resolution rates in each district
g <- ggplot(crime, aes(Police.District))
g + geom_bar(aes(fill=Resolution), width = 0.5) +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Resolution in each District",
)
From this graph, we can observe that maximum number of cases in all the districts are open or active.Tenderloin has the most arrests and Park has the least. There are some unfounded resolutions in each District.
##Temporal Trends
#a) Hourly Crime Distribution
We can create a dataframe by grouping by Incident Hour.
crime_perhours <- crime %>%
group_by(Incident.Hour) %>%
summarise(TotalCrime = n())
head(crime_perhours)
## # A tibble: 6 × 2
## Incident.Hour TotalCrime
## <chr> <int>
## 1 00 36046
## 2 01 17951
## 3 02 15561
## 4 03 12636
## 5 04 10651
## 6 05 10034
ggplot(crime_perhours, aes(x = Incident.Hour, y = TotalCrime))+
geom_col(fill = "orange")+
theme_minimal()+
labs(
title = "Crime per hour, San Francisco 2018 - 2022",
x = "Hours",
y = "Total Crime"
)
We can observe that maximum crimes occur around 12 in the afternoon, then another spike at midnight. It then gradually decreases from 1 am - 7 am in the morning.
#b) Theft time Heatmap
Of the above hourly districution, let us focus on Larcent Theft and plot the heatmap.
df_theft_time <- crime %>%
filter(Incident.Category=="Larceny Theft")%>%
group_by(Incident.Day.of.Week, Incident.Hour) %>%
summarize(count = n())
head(df_theft_time)
## # A tibble: 6 × 3
## # Groups: Incident.Day.of.Week [1]
## Incident.Day.of.Week Incident.Hour count
## <chr> <chr> <int>
## 1 Friday 00 1336
## 2 Friday 01 600
## 3 Friday 02 459
## 4 Friday 03 417
## 5 Friday 04 291
## 6 Friday 05 300
plot <- ggplot(df_theft_time, aes(x = Incident.Hour, y = Incident.Day.of.Week, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.6), legend.title = element_blank(), legend.position="top", legend.direction="horizontal", legend.key.width=unit(2, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm"), panel.margin=element_blank()) +
labs(x = "Hour of Theft (Local Time)", y = "Day of Week of Theft", title = "Number of Thefts in San Francisco from 2018 – 2022, by Time of Theft") +
scale_fill_gradient(low = "white", high = "orange")
plot
From this heatmap, we see a spike in crimes over the weekends in the midnight. We can also seespikes during weekdays around the evenig time from 5pm-8-pm.
##Crime hotspots
#a) Map of San Francisco
library(ggmap)
sf = get_stamenmap(bbox = c(left = -122.5164, bottom = 37.7066, right = -122.3554, top = 37.8103),
maptype = c("toner-lite"), zoom = 13)
map = ggmap(sf)
map
#b) Plotting the first 500 random points on the map
map + geom_point(data = sample_n(crime, 500), aes(x = Longitude, y = Latitude))
We can already observe a clustering towards the city center.
#c) Density Plot
map +
stat_density2d( data = sample_frac(crime, 0.2), aes(x = Longitude, y = Latitude, fill = ..level.., alpha = ..level..), size = 1, bins = 50, geom = 'polygon') +
scale_fill_gradient('Crime\nDensity', low = 'blue', high = 'orange') +
scale_alpha(range = c(.2, .3), guide = FALSE) +
guides(fill = guide_colorbar(barwidth = 1.5, barheight = 10))
We can observe certain intersections and having higher density and the maximum density being at the city center area.The parks and grassland regions have no reported crimes.
###Conclusion
Thus with the help of the SAN FRANCISCO crime dataset, we could plot thee graphs and gain some insight.